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Abstract. Compressive sampling offers a new paradigm for acquiring signals that are compressible 
qq with respect to an orthonormal basis. The major algorithmic challenge in compressive sampling 

is to approximate a compressible signal from noisy samples. This paper describes a new iterative 
recovery algorithm called CoSaMP that delivers the same guarantees as the best optimization-based 
£SJ approaches. Moreover, this algorithm offers rigorous bounds on computational cost and storage. 

3 I It is likely to be extremely efficient for practical problems because it requires only matrix-vector 

O | multiplies with the sampling matrix. For compressible signals, the running time is just 0(N log 2 AT), 

where N is the length of the signal. 

o 
t— i 

1. Introduction 

Most signals of interest contain scant information relative to their ambient dimension, but the 
classical approach to signal acquisition ignores this fact. We usually collect a complete represen- 
tation of the target signal and process this representation to sieve out the actionable information. 
Then we discard the rest. Contemplating this ugly inefficiency, one might ask if it is possible 
instead to acquire compressive samples. In other words, is there some type of measurement that 
automatically winnows out the information from a signal? Incredibly, the answer is sometimes yes. 

Compressive sampling refers to the idea that, for certain types of signals, a small number of 
^ nonadaptive samples carries sufficient information to approximate the signal well. Research in this 

area has two major components: 

Sampling: How many samples are necessary to reconstruct signals to a specified precision? 
What type of samples? How can these sampling schemes be implemented in practice? 

CO 

Reconstruction: Given the compressive samples, what algorithms can efficiently construct 
OO a signal approximation? 

The literature already contains a well-developed theory of sampling, which we summarize below. 
Although algorithmic work has been progressing, the state of knowledge is less than complete. We 
^ assert that a practical signal reconstruction algorithm should have all of the following properties. 

• It should accept samples from a variety of sampling schemes. 

• It should succeed using a minimal number of samples. 

• It should be robust when samples are contaminated with noise. 

• It should provide optimal error guarantees for every target signal. 

• It should offer provably efficient resource usage. 

To our knowledge, no approach in the literature simultaneously accomplishes all five goals. 
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This paper presents and analyzes a novel signal reconstruction algorithm that achieves these 
desiderata. The algorithm is called CoSaMP, from the acrostic Compressive Sampling Matching 
Pursuit. As the name suggests, the new method is ultimately based on orthogonal matching pursuit 
(OMP) [36], but it incorporates several other ideas from the literature to accelerate the algorithm 
and to provide strong guarantees that OMP cannot. Before we describe the algorithm, let us deliver 
an introduction to the theory of compressive sampling. 

1.1. Rudiments of Compressive Sampling. To enhance intuition, we focus on sparse and com- 
pressible signals. For vectors in C , define the £q quasi-norm 

||x|| = |supp(x)| = \{j : xj / 0}| . 

We say that a signal x is s-sparse when ||x|| < s. Sparse signals are an idealization that we 
do not encounter in applications, but real signals are quite often compressible, which means that 
their entries decay rapidly when sorted by magnitude. As a result, compressible signals are well 
approximated by sparse signals. We can also talk about signals that are compressible with respect 
to other orthonormal bases, such as a Fourier or wavelet basis. In this case, the sequence of 
coefficients in the orthogonal expansion decays quickly. It represents no loss of generality to focus 
on signals that are compressible with respect to the standard basis, and we do so without regret. 
For a more precise definition of compressibility, turn to Section 2.6. 

In the theory of compressive sampling, a sample is a linear functional applied to a signal. The 
process of collecting multiple samples is best viewed as the action of a sampling matrix $ on the 
target signal. If we take m samples, or measurements, of a signal in C^, then the sampling matrix 
$ has dimensions m x N. A natural question now arises: How many measurements are necessary 
to acquire s-sparse signals? 

The minimum number of measurements m > 2s on account of the following simple argument. 
The sampling matrix must not map two different s-sparse signals to the same set of samples. 
Therefore, each collection of 2s columns from the sampling matrix must be nonsingular. It is easy 
to see that certain Vandermonde matrices satisfy this property, but these matrices are not really 
suitable for signal acquisition because they contain square minors that are very badly conditioned. 
As a result, some sparse signals are mapped to very similar sets of samples, and it is unstable to 
invert the sampling process numerically. 

Instead, Candes and Tao proposed the stronger condition that the geometry of sparse signals 
should be preserved under the action of the sampling matrix [6] . To quantify this idea, they defined 
the rth restricted isometry constant of a matrix $ as the least number S r for which 

(1 — S r ) \\x\\ 2 < ||^a5||2 < (1 + S r ) \\x\\2 whenever ||x|| < r. (1-1) 

We have written ||-|| 2 for the £2 vector norm. When 5 r < 1, these inequalities imply that each 
collection of r columns from <I> is nonsingular, which is the minimum requirement for acquiring 
(r/2)-sparse signals. When 5 r <C 1, the sampling operator very nearly maintains the £2 distance 
between each pair of (r/2)-sparse signals. In consequence, it is possible to invert the sampling 
process stably. 

To acquire s-sparse signals, one therefore hopes to achieve a small restricted isometry constant 
62s with as few samples as possible. A striking fact is that many types of random matrices have 
excellent restricted isometry behavior. For example, we can often obtain 62s < 0.1 with 

m = 0(s log" N) 

measurements, where a is a small integer. Unfortunately, no deterministic sampling matrix is 
known to satisfy a comparable bound. Even worse, it is computationally difficult to check the 
inequalities (1.1), so it may never be possible to exhibit an explicit example of a good sampling 
matrix. 
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As a result, it is important to understand how random sampling matrices behave. The two 
quintessential examples are Gaussian matrices and partial Fourier matrices. 

Gaussian matrices: If the entries of \/m<& are independent and identically distributed stan- 
dard normal variables then 

m> Crk «W f) - S r <s (1.2) 

except with probability e~ cm . See [6] for details. 



Partial Fourier matrices: If y/m<& is a uniformly random set of m rows drawn from the 
N x N unitary discrete Fourier transform (DFT), then 

CrWiV-log^- 1 ) 
m > ^ =>■ o r < e (1.3) 

except with probability iV -1 . See [32] for the proof. Experts believe that the power on the 
first logarithm should be no greater than two [30] . 

Here and elsewhere, we follow the analyst's convention that upright letters (c, C, . . . ) refer to 
positive, universal constants that may change from appearance to appearance. 

The Gaussian matrix is important because it has optimal restricted isometry behavior. Indeed, 
for any m x N matrix, 

5 r <0.1 => m > Crlog(AT/r), 

on account of profound geometric results of Kashin [23] and Garnaev-Gluskin [15] . Even though 
partial Fourier matrices may require additional samples to achieve a small restricted isometry 
constant, they are more interesting for the following reasons [2]. 

• There are technologies that acquire random Fourier measurements at unit cost per sample. 

• The sampling matrix can be applied to a vector in time 0(iVlog N). 

• The sampling matrix requires only O(mlogiV) storage. 

Other types of sampling matrices, such as the random demodulator [35], enjoy similar qualities. 
These traits are essential for the translation of compressive sampling from theory into practice. 

1.2. Signal Recovery Algorithms. The major algorithmic challenge in compressive sampling is 
to approximate a signal given a vector of noisy samples. The literature describes a huge number of 
approaches to solving this problem. They fall into three rough categories: 

Greedy pursuits: These methods build up an approximation one step at a time by mak- 
ing locally optimal choices at each step. Examples include OMP [36], stagewise OMP 
(StOMP) [13], and regularized OMP (ROMP) [28, 27]. 

Convex relaxation: These techniques solve a convex program whose minimizer is known 
to approximate the target signal. Many algorithms have been proposed to complete the 
optimization, including interior-point methods [2, 24], projected gradient methods [14], and 
iterative thresholding [11]. 

Combinatorial algorithms: These methods acquire highly structured samples of the sig- 
nal that support rapid reconstruction via group testing. This class includes Fourier sam- 
pling [19, 21], chaining pursuit [16], and HHS pursuit [17], as well as some algorithms of 
Cormode-Muthukrishnan [9] and Iwen [22]. 

At present, each type of algorithm has its native shortcomings. Many of the combinatorial 
algorithms are extremely fast — sublinear in the length of the target signal — but they require a 
large number of somewhat unusual samples that may not be easy to acquire. At the other extreme, 
convex relaxation algorithms succeed with a very small number of measurements, but they tend 
to be computationally burdensome. Greedy pursuits — in particular, the ROMP algorithm — are 
intermediate in their running time and sampling efficiency 
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CoSaMP, the algorithm described in this paper, is at heart a greedy pursuit. It also incorporates 
ideas from the combinatorial algorithms to guarantee speed and to provide rigorous error bounds 
[17]. The analysis is inspired by the work on ROMP [28, 27] and the work of Candes-Flomberg-Tao 
[3] on convex relaxation methods. In particular, we establish the following result. 

Theorem A (CoSaMP). Suppose that $ is an m x N sampling matrix with restricted isometry 
constant 82s < c. Let u = &x + e be a vector of samples of an arbitrary signal, contaminated with 
arbitrary noise. For a given precision parameter rj, the algorithm CoSaMP produces a 2s-sparse 
approximation a that satisfies 



where x s is a best s-sparse approximation to x. The running time is 0(Jz? • log(||x|| 2 /v))> where 
Jz? bounds the cost of a matrix-vector multiply with $ or The working storage use is O(N). 

Let us expand on the statement of this result. First, recall that many types of random sampling 
matrices satisfy the restricted isometry hypothesis when the number of samples m = 0(slog a N). 
Therefore, the theorem applies to a wide class of sampling schemes when the number of samples is 
proportional to the target sparsity and logarithmic in the ambient dimension of the signal space. 

The algorithm produces a 2s-sparse approximation whose £2 error is comparable with the scaled 
t\ error of the best s-sparse approximation to the signal. Of course, the algorithm cannot resolve 
the uncertainty due to the additive noise, so we also pay for the energy in the noise. This type of 
error bound is structurally optimal, as we discuss in Section 2.6. Some disparity in the sparsity 
levels (here, 2s versus s) seems to be necessary when the recovery algorithm is computationally 
efficient [31]. 

We can interpret the error guarantee as follows. In the absence of noise, the algorithm can recover 
an s-sparse signal to arbitrarily high precision. Performance degrades gracefully as the energy in 
the noise increases. Performance also degrades gracefully for compressible signals. The theorem 
is ultimately vacuous for signals that cannot be approximated by sparse signals, but compressive 
sampling is not an appropriate technique for this class. 

The running time bound indicates that each matrix-vector multiplication reduces the error by 
a constant factor (if we amortize over the entire execution). That is, the algorithm has linear 
convergence 1 . We find that the total runtime is roughly proportional to (the negation of) the 
reconstruction signal-to-noise ratio 



For compressible signals, one can show that |R-SNR| = O(logs). The runtime is also proportional 
to the cost of a matrix-vector multiply. For sampling matrices with a fast multiply, the algorithm 
is accelerated substantially. In particular, for the partial Fourier matrix, a matrix-vector multiply 
requires time 0(Nlog N). It follows that the total runtime is 0(NlogN • |R-SNR|). For most 
signals of interest, this cost is nearly linear in the signal length! 

1.3. Notation. Let us instate several pieces of notation that are carried throughout the paper. For 
p G [l,oo], we write ||-|| for the usual £ p vector norm. We reserve the symbol ||-|| for the spectral 
norm, i.e., the natural norm on linear maps from £2 to £2- 

Suppose that a; is a signal in and r is a positive integer. We write x r for the signal in C N that 
is formed by restricting x to its r largest-magnitude components. Ties are broken lexicographically. 



Mathematicians sometimes refer to linear convergence as "exponential convergence." 



x — a|| 2 < C • max < rj, \\x — Xg^ + \\e 






dB. 
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This signal is a best r-sparse approximation to x with respect to any t v norm. Suppose now that 
T is a subset of {1, 2, ... , N}. We define the restriction of the signal to the set T as 



We occasionally abuse the notation and treat x\t as an element of the vector space C . We also 
define the restriction <&t of the sampling matrix <I> as the column submatrix whose columns are 
listed in the set T. 

Finally, we define the pseudoinverse of a tall, full-rank matrix A by the formula A' = (A* A) -1 A* . 

1.4. Organization. The rest of the paper has the following structure. In Section 2 we introduce the 
CoSaMP algorithm, we state the major theorems in more detail, and we discuss implementation and 
resource requirements. Section 3 describes some consequences of the restricted isometry property 
that pervade our analysis. The central theorem is established for sparse signals in Sections 4 and 5. 
We extend this result to general signals in Section 6. Finally, Section 7 places the algorithm in 
the context of previous work. The first appendix presents variations on the algorithm. The second 
appendix contains a bound on the number of iterations required when the algorithm is implemented 
using exact arithmetic. 



This section gives an overview of the algorithm, along with explicit pseudocode. It presents the 
major theorems on the performance of the algorithm. Then it covers details of implementation and 
bounds on resource requirements. 

2.1. Intuition. The most difficult part of signal reconstruction is to identify the locations of the 
largest components in the target signal. CoSaMP uses an approach inspired by the restricted 
isometry property. Suppose that the sampling matrix $ has restricted isometry constant S s <C 1. 
For an s-sparse signal x, the vector y = &*<&x can serve as a proxy for the signal because the 
energy in each set of s components of y approximates the energy in the corresponding s components 
of x. In particular, the largest s entries of the proxy y point toward the largest s entries of the 
signal x. Since the samples have the form u = <&x, we can obtain the proxy just by applying the 
matrix to the samples. 

The algorithm invokes this idea iteratively to approximate the target signal. At each iteration, 
the current approximation induces a residual, the part of the target signal that has not been 
approximated. As the algorithm progresses, the samples are updated so that they reflect the 
current residual. These samples are used to construct a proxy for the residual, which permits us 
to identify the large components in the residual. This step yields a tentative support for the next 
approximation. We use the samples to estimate the approximation on this support set using least 
squares. This process is repeated until we have found the recoverable energy in the signal. 

2.2. Overview. As input, the CoSaMP algorithm requires four pieces of information: 

• Access to the sampling operator via matrix-vector multiplication. 

• A vector of (noisy) samples of the unknown signal. 

• The sparsity of the approximation to be produced. 

• A halting criterion. 

The algorithm is initialized with a trivial signal approximation, which means that the initial residual 
equals the unknown target signal. During each iteration, CoSaMP performs five major steps: 

(1) Identification. The algorithm forms a proxy of the residual from the current samples and 
locates the largest components of the proxy. 




0, otherwise. 



2. The CoSaMP Algorithm 
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(2) Support Merger. The set of newly identified components is united with the set of com- 
ponents that appear in the current approximation. 

(3) Estimation. The algorithm solves a least-squares problem to approximate the target signal 
on the merged set of components. 

(4) Pruning. The algorithm produces a new approximation by retaining only the largest 
entries in this least-squares signal approximation. 

(5) Sample Update. Finally, the samples are updated so that they reflect the residual, the 
part of the signal that has not been approximated. 

These steps are repeated until the halting criterion is triggered. In the body of this work, we 
concentrate on methods that use a fixed number of iterations. Appendix A discusses some other 
simple stopping rules that may also be useful in practice. 

Pseudocode for CoSaMP appears as Algorithm 2.1. This code describes the version of the al- 
gorithm that we analyze in this paper. Nevertheless, there are several adjustable parameters that 
may improve performance: the number of components selected in the identification step and the 
number of components retained in the pruning step. For a brief discussion of other variations on 
the algorithm, turn to Appendix A. 



Algorithm 2.1: CoSaMP Recovery Algorithm 



CoSaMP($, u, s) 

Input: Sampling matrix noisy sample vector u, sparsity level s 
Output: An s-sparse approximation a of the target signal 




{ Trivial initial approximation } 


v <— u 


{ Current samples = input samples } 


k <- 




repeat 




k <- k + 1 




y <— <&*v 


{ Form signal proxy } 


tt <- supp(y 2s ) 


{ Identify large components } 


T^^Usupp(a fc - 1 ) 


{ Merge supports } 




{ Signal estimation by least-squares } 


b p <— 




a k <- b s 


{ Prune to obtain next approximation } 


v <— u — $a fc 


{ Update current samples } 


until halting criterion true 
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2.3. Performance Guarantees. This section describes our theoretical analysis of the behavior of 
CoSaMP. The next section covers the resource requirements of the algorithm. Afterward, Section 2.5 
combines these materials to establish Theorem A. 

Our results depend on a set of hypotheses that has become common in the compressive sampling 
literature. Let us frame the standing assumptions: 



CoSaMP Hypotheses 

• The sparsity level s is fixed. 

• The m x N sampling operator <I> has restricted isometry constant 5^ < 0.1. 

• The signal x G C N is arbitrary, except where noted. 

• The noise vector e £ C m is arbitrary. 

• The vector of samples u = <&x + e. 

We also define the unrecoverable energy v in the signal. This quantity measures the baseline error 
in our approximation that occurs because of noise in the samples or because the signal is not sparse. 



I *^ ^^112 I — 11*^ 11^112 ' 

V s 



We postpone a more detailed discussion of the unrecoverable energy until Section 2.6. 

Our key result is that CoSaMP makes significant progress during each iteration where the ap- 
proximation error is large relative to the unrecoverable energy. 

Theorem 2.1 (Iteration Invariant). For each iteration k > 0, the signal approximation a k is 
s-sparse and 



In particular, 



\x - a k+1 \\ 2 < 0.5\\x - a k \\ 2 + IOza 



\x - a k \\ 2 < 2~ k \\x\\ 2 + 20v. 



x 



The proof of Theorem 2.1 will occupy us for most of this paper. In Section 4, we establish an 
analog for sparse signals. The version for general signals appears as a corollary in Section 6. 

Theorem 2.1 has some immediate consequences for the quality of reconstruction with respect to 
standard signal metrics. In this setting, a sensible definition of the signal-to-noise ratio (SNR) is 

SNR = 101og 10 ' ' 

The reconstruction SNR is defined as 

R-SNR = 101og 10 

Both quantities are measured in decibels. Theorem 2.1 implies that, after k iterations, the recon- 
struction SNR satisfies 

R-SNR < 3 - min{3/c, SNR - 13}. 

In words, each iteration reduces the reconstruction SNR by about 3 decibels until the error nears 
the noise floor. To reduce the error to its minimal value, the number of iterations is proportional 
to the SNR. 

Let us consider a slightly different scenario. Suppose that the signal x is s-sparse, so the unre- 
coverable energy v = ||e|| 2 . Define the dynamic range 

( max \xA \ 

A = 10 log in . — - where 1 ranges over supp(a;). 

V mm \Xj\ I 
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Assume moreover that the minimum nonzero component of the signal is at least 40za Using the 
fact that ||x|| 2 < \fs H^Hoo, it is easy to check that ||jc — a\\ 2 < min \xi\ as soon as the number k of 
iterations satisfies 

k > 3.3A + log 2V / s + 1- 

It follows that the support of the approximation a must contain every entry in the support of the 
signal x. 

This discussion suggests that the number of iterations might be substantial if we require a very 
low reconstruction SNR or if the signal has a very wide dynamic range. This initial impression is 
not entirely accurate. We have established that, when the algorithm performs arithmetic to high 
enough precision, then a fixed number of iterations suffices to reduce the approximation error to 
the same order as the unrecoverable energy. Here is a result for exact computations. 

Theorem 2.2 (Iteration Count). Suppose that CoSaMP is implemented with exact arithmetic. 
After at most 6(s + 1) iterations, CoSaMP produces an s-sparse approximation a that satisfies 

\\x — a\\ 2 < 20v. 

In fact, even more is true. The number of iterations depends significantly on the structure of 
the signal. The only situation where the algorithm needs fi(s) iterations occurs when the entries 
of the signal decay exponentially. For signals whose largest entries are comparable, the number of 
iterations may be as small as O(logs). This claim is quantified in Appendix B, where we prove 
Theorem 2.2. 

If we solve the least-squares problems to high precision, an analogous result holds, but the 
approximation guarantee contains an extra term that comes from solving the least-squares problems 
imperfectly. In practice, it may be more efficient overall to solve the least-squares problems to 
low precision. The correct amount of care seems to depend on the relative costs of forming the 
signal proxy and solving the least-squares problem, which are the two most expensive steps in the 
algorithm. We discuss this point in the next section. Ultimately, the question is best settled with 
empirical studies. 

Remark 2.3. In the hypotheses, a bound on the restricted isometry constant 62s also suffices. 
Indeed, Corollary 3.4 of the sequel implies that 5^ < 0.1 holds whenever 82s < 0.025. 

Remark 2.4. The expression (2.1) for the unrecoverable energy can be simplified using Lemma 7 
from [17], which states that, for every signal y G and every positive integer t, we have 

Ilv-I*ll 2 <^llvlli- 

Choosing y = x — x s j 2 and t = s/2, we reach 

1-71 1 1 , x 

v< — ^||x-a; a /2|| 1 + ||e|| 2 . (2.2) 

In words, the unrecoverable energy is controlled by the scaled £1 norm of the signal tail. 

2.4. Implementation and Resource Requirements. CoSaMP was designed to be a practical 
method for signal recovery. An efficient implementation of the algorithm requires some ideas from 
numerical linear algebra, as well as some basic techniques from the theory of algorithms. This 
section discusses the key issues and develops an analysis of the running time for the two most 
common scenarios. 

We focus on the least-squares problem in the estimation step because it is the major obstacle 
to a fast implementation of the algorithm. The algorithm guarantees that the matrix 3>r never 
has more than 3s columns, so our assumption 84s < 0.1 implies that the matrix is extremely 
well conditioned. As a result, we can apply the pseudoinverse ^ T = (^fr^r) -1 *^ ver y quickly 
using an iterative method, such as Richardson's iteration [1, Sec. 7.2.3] or conjugate gradient [1, 
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Sec. 7.4]. These techniques have the additional advantage that they only interact with the matrix 
3>T through its action on vectors. It follows that the algorithm performs better when the sampling 
matrix has a fast matrix-vector multiply. 

Section 5 contains an analysis of the performance of iterative least-squares algorithms in the 
context of CoSaMP. In summary, if we initialize the least-squares method with the current approx- 
imation a fe_1 , then the cost of solving the least-squares problem is O(Jzf), where Jz? bounds the 
cost of a matrix- vector multiply with 3>r or *&t- This implementation ensures that Theorem 2.1 
holds at each iteration. 

We emphasize that direct methods for least squares are likely to be extremely inefficient in this 
setting. The first reason is that each least-squares problem may contain substantially different sets 
of columns from As a result, it becomes necessary to perform a completely new QR or SVD 
factorization during each iteration at a cost of 0(s 2 m). The second problem is that computing these 
factorizations typically requires direct access to the columns of the matrix, which is problematic 
when the matrix is accessed through its action on vectors. Third, direct methods have storage costs 
O(sm), which may be deadly for large-scale problems. 

The remaining steps of the algorithm are standard. Let us estimate the operation counts. 

Proxy: Forming the proxy is dominated by the cost of the matrix-vector multiply &*v. 

Identification: We can locate the largest 2s entries of a vector in time O(N) using the 
approach in [8, Ch. 9]. In practice, it may be faster to sort the entries of the signal in 
decreasing order of magnitude at cost 0(iVlog N) and then select the first 2s of them. The 
latter procedure can be accomplished with quicksort, mergesort, or heapsort [8, Sec. II]. To 
implement the algorithm to the letter, the sorting method needs to be stable because we 
stipulate that ties are broken lexicographically. This point is not important in practice. 

Support Merger: We can merge two sets of size O(s) in expected time O(s) using random- 
ized hashing methods [8, Ch. 11]. One can also sort both sets first and use the elementary 
merge procedure [8, p. 29] for a total cost O(slogs). 

LS estimation: We use Richardson's iteration or conjugate gradient to compute ^ T u. Ini- 
tializing the least-squares algorithm requires a matrix- vector multiply with <&y. Each 
iteration of the least-squares method requires one matrix-vector multiply each with *&t 
and <&j.. Since <&t is a submatrix of the matrix- vector multiplies can also be obtained 
from multiplication with the full matrix. We prove in Section 5 that a constant number of 
least-squares iterations suffices for Theorem 2.1 to hold. 

Pruning: This step is similar to identification. Pruning can be implemented in time O(s), 
but it may be preferable to sort the components of the vector by magnitude and then select 
the first s at a cost of O (slogs). 

Sample Update: This step is dominated by the cost of the multiplication of $ with the 
s-sparse vector a k . 

Table 1 summarizes this discussion in two particular cases. The first column shows what happens 
when the sampling matrix $ is applied to vectors in the standard way, but we have random access 
to submatrices. The second, column shows what happens when the sampling matrix $ and its 
adjoint both have a fast multiply with cost Jzf, where we assume that Jz? > N. A typical value 
is Jz? = 0(iVlog N). In particular, a partial Fourier matrix satisfies this bound. 

Finally, we note that the storage requirements of the algorithm are also favorable. Aside from 
the storage required by the sampling matrix, the algorithm constructs only one vector of length 
N, the signal proxy. The sample vectors u and v have length m, so they require O(m) storage. 
The signal approximations can be stored using sparse data structures, so they require at most 
0(slog N) storage. Similarly, the index sets that appear require only 0(slog N) storage. The total 
storage is O(N). 
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Table 1. Operation count for CoSaMP. Big-0 notation is omitted for legibility. 
The dimensions of the sampling matrix $ are m x N; the sparsity level is s. The 
number Jz? bounds the cost of a matrix-vector multiply with <1> or <1>* . 



Step 


Standard multiply 


Fast multiply 


Form proxy 


mN 


Jgf 


Identification 


N 


N 


Support merger 


s 


s 


LS estimation 


sm 


% 


Pruning 


s 


s 


Sample update 


sm 


££ 


Total per iteration 


0(mN) 


0(JS?) 



The following result summarizes this discussion. 

Theorem 2.5 (Resource Requirements). Each iteration of CoSaMP requires 0(Jz?) time, where Jz? 
bounds the cost of a multiplication with the matrix $ or The algorithm uses storage O(N). 

Remark 2.6. We have been able to show that Theorem 2.2 holds when the least-squares prob- 
lems are solved iteratively with a delicately chosen stopping threshold. In this case, the total 
number of least-squares iterations performed over the entire execution of the algorithm is at most 
O (log( 1 1 sc 1 1 2 /r])) if we wish to achieve error 0(r] + v). When the cost of forming the signal proxy 
is much higher than the cost of solving the least-squares problem, this analysis may yield a sharper 
result. For example, using standard matrix-vector multiplication, we have a runtime bound 

0(s • mN + log ( 1 1 a? 1 1 2 /rj) • sm). 

The first term reflects the number of CoSaMP iterations times the cost of forming the signal proxy. 
The second term reflects the total cost of the least-squares iterations. Unless the relative precision 
\\x\\ 2 /rj is superexponential in the signal length, we obtain running time O(smN). This bound is 
comparable with the worst-case cost of OMP or ROMP. As we discuss in Appendix B, the number 
of CoSaMP iterations may be much smaller than s, which also improves the estimate. 

2.5. Proof of Theorem A. We have now collected all the material we need to establish the main 
result. Fix a precision parameter m After at most 0(log(||cc || 2 /rj)) iterations, CoSaMP produces 
an s-sparse approximation a that satisfies 

|| as — a|| 2 < C ■ (77 + v) 

in consequence of Theorem 2.1. Apply inequality (2.2) to bound the unrecoverable energy v in 
terms of the i\ norm. We see that the approximation error satisfies 

||sc — a|| 2 < C • max |ry, —j= ||cc — £c s / 2 1| 2 + Il e ll2| • 

According to Theorem 2.5, each iteration of CoSaMP is completed in time O(Jzf), where Jz? 
bounds the cost of a matrix-vector multiplication with $ or The total runtime, therefore, 
is 0(j£?log(||x|| 2 /r))). The total storage is O(N). 

In the statement of the theorem, perform the substitution s/2 h s. Finally, we replace 5s s with 
62s by means of Corollary 3.4, which states that 5 cr < c ■ 52r for any positive integers c and r. 
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2.6. The Unrecoverable Energy. Since the unrecoverable energy v plays a central role in our 
analysis of CoSaMP, it merits some additional discussion. In particular, it is informative to examine 
the unrecoverable energy in a compressible signal. Let p be a number in the interval (0, 1). We say 
that x is p- compressible with magnitude R if the sorted components of the signal decay at the rate 

M ( i) < R-r 1/p for i = 1,2,3,.... 

When p = 1, this definition implies that \\x\lj_ < R ■ (1 + logiV). Therefore, the unit ball of 
1-compressible signals is similar to the t\ unit ball. When p ~ 0, this definition implies that p- 
compressible signals are very nearly sparse. In general, compressible signals are well approximated 
by sparse signals: 

||a>-a>«Hi < Cp-R-s 1 - 1 ^ 

1 1 *^ *^ s 1 1 2 -^—^p * ' S ^ ^ 

where C p = (1/p — and D p = (2/p — l) -1 / 2 . These results follow by writing each norm as a 
sum and approximating the sum with an integral. We see that the unrecoverable energy (2.1) in a 
p- compressible signal is bounded as 

u <2C P - R - s 1 ^ 1 ^ + \\e\\ 2 . (2.3) 

When p is small, the first term in the unrecoverable energy decays rapidly as the sparsity level s 
increases. For the class of p-compressible signals, the bound (2.3) on the unrecoverable energy is 
sharp, modulo the exact values of the constants. 

With these inequalities, we can see that CoSaMP recovers compressible signals efficiently. Let 
us calculate the number of iterations required to reduce the approximation error from ||aj|| 2 to the 
optimal level (2.3). For compressible signals, the energy ||x|| 2 < 2R, so 

bg ( 2 C p -fl 2 ^/2-Vp ) = log(1/p " 1} + {1/P ~ 1/2) log S - 

Therefore, the number of iterations required to recover a generic p-compressible signal is O(logs), 
where the constant in the big-0 notation depends on p. 

The term "unrecoverable energy" is justified by several facts. First, we must pay for the £2 error 
contaminating the samples. To check this point, define S = supp(a; s ). The matrix $5 is nearly 
an isometry from £ 2 to i™, so an error in the large components of the signal induces an error of 
equivalent size in the samples. Clearly, we can never resolve this uncertainty. 

The term s -1 / 2 ||x — 2J s |li is also required on account of classical results about the Gel'fand widths 
of the li ball in i 2 , due to Kashin [23] and Garnaev-Gluskin [15]. In the language of compressive 
sampling, their work has the following interpretation. Let $ be a fixed m x N sampling matrix. 
Suppose that, for every signal x G C^, there is an algorithm that uses the samples u = $x to 
construct an approximation a that achieves the error bound 

_C_ _ 

Then the number m of measurements must satisfy m > cslog(N/ s). 

3. Restricted Isometry Consequences 

When the sampling matrix satisfies the restricted isometry inequalities (1.1), it has several other 
properties that we require repeatedly in the proof that the CoSaMP algorithm is correct. Our first 
observation is a simple translation of (1.1) into other terms. 
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Proposition 3.1. Suppose $ has restricted isometry constant 5 r . Let T be a set of r indices or 
fewer. Then 



\®tu\\o < \A + *r 



u 



Hu\\ 2 < 7TTTr \\u\\ 2 
\\$t®tx\\ 2 = (1 ± 5 r ) \\x\\ 2 

where the last two statements contain an upper and lower bound, depending on the sign chosen. 

Proof. The restricted isometry inequalities (1.1) imply that the singular values of <&t he between 
\f\ — 5 r and \f\ + 5 r . The bounds follow from standard relationships between the singular values 
of <&t and the singular values of basic functions of <&t- D 

A second consequence is that disjoint sets of columns from the sampling matrix span nearly 
orthogonal subspaces. The following result quantifies this observation. 

Proposition 3.2 (Approximate Orthogonality). Suppose $ has restricted isometry constant S r . 
Let S and T be disjoint sets of indices whose combined cardinality does not exceed r. Then 

||*S*t|| < fir- 

Proof. Abbreviate R = S U T, and observe that <f>* s <f>T is a submatrix of &* r *&r — I. The spectral 
norm of a submatrix never exceeds the norm of the entire matrix. We discern that 

||*S*t|| < \\$*r®R - III < max{(l + S r ) - 1, 1 - (1 - 5 r )} = 6 r 

because the eigenvalues of <&* r <&r lie between 1 — 5 r and 1 + S r . □ 

This result will be applied through the following corollary. 

Corollary 3.3. Suppose $ has restricted isometry constant S r . Let T be a set of indices, and let 
x be a vector. Provided that r > \T U supp(a;)| ; 

||3?T* " 35 1 T c 1 1 2 — $r ||*|t c ||2 ■ 

Proof. Define S = supp(a;) \ T, so we have x\s = x\t^- Thus, 

||*r* " £c|t c || 2 = ll*T* " x \s\\2 ^ ll*T*s|| ll^ls 1 !^ — ^ \\ x \t c \\ 2 j 
owing to Proposition 3.2. □ 

As a second corollary, we show that <$2r gives weak control over the higher restricted isometry 
constants. 

Corollary 3.4. Let c and r be positive integers. Then 5 cr < c • &i r . 

Proof. The result is clearly true for c = 1, 2, so we assume c > 3. Let S be an arbitrary index set 
of size cr, and let M = *&* s *&s ~ I- It suffices to check that ||iW|| < c • for- To that end, we break 
the matrix M into r x r blocks, which we denote Mij. A block version of Gershgorin's theorem 
states that \\M \\ satisfies at least one of the inequalities 

|||M|| - ||Mii||| < y ||Mjd| where i = 1,2, ... ,c. 

The derivation is entirely analogous with the usual proof of Gershgorin's theorem, so we omit the 
details. For each diagonal block, we have < 5 r because of the restricted isometry inequalities 

(1.1). For each off-diagonal block, we have ||Mjj|| < for because of Proposition 3.2. Substitute 
these bounds into the block Gershgorin theorem and rearrange to complete the proof. □ 
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Finally, we present a result that measures how much the sampling matrix inflates nonsparse 
vectors. This bound permits us to establish the major results for sparse signals and then transfer 
the conclusions to the general case. 

Proposition 3.5 (Energy Bound). Suppose that <1> verifies the upper inequality of (1.1), viz. 

\\<&x\\ 2 < y/l + S r \\x\\ 2 when \\x\\ Q < r. 

Then, for every signal x, 

||3>£c|| 2 < \A + S r 



1 

\x\\ 2 + — 



r 



Proof. We repeat the geometric argument of Rudelson that is presented in [17]. 

First, observe that the hypothesis of the proposition can be regarded as a statement about the 
operator norm of $ as a map between two Banach spaces. For a set I C {1, 2, ... , N}, write B 2 
for the unit ball in £2(1). Define the convex body 

5 = conv {U m < r ^} cC ^ 

and notice that, by hypothesis, the operator norm 



|3>lls->2 = max ||3>a;|| 2 < \/l + S r . 



Define a second convex body 



A - < x : ||a;|| 2 + \\x\^ < 1 \ C C A . 



and consider the operator norm 



\&\\ K ^ 2 =max\\$x\\ 2 . 



The content of the proposition is the claim that 

||*||tf_ 2 ^ H*lls^2- 

To establish this point, it suffices to check that K C S. 

Instead, we prove the reverse inclusion for the polars: S° C K° . The norm with unit ball S° is 
calculated as 

Hull qo = max ||u| r\\o . 

Consider a vector u in the unit ball S°, and let 7 be a set of r coordinates where u is largest in 
magnitude. We must have 

NHL 

or else \u{\ > ^ for each i e I. But then \\u\\ so > ||u|/|| 2 > 1, a contradiction. Therefore, we may 
write 

u = u\i + u|/ c e B 2 + ^-Boo, 

Vr 

where B p is the unit ball in 1^. But the set on the right-hand side is precisely the unit ball of K° 
since 

/ \ 1 

su PveK° \ x i v ) = \\ X \\K = \\ X \\2 + \\ X \\l 

= su PveB 2 (a, v) + sup w 1 B {x, w) = sup v B + 1 B (aj, v) . 
In summary, S° C K° . □ 
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4. The Iteration Invariant: Sparse Case 

We now commence the proof of Theorem 2.1. For the moment, let us assume that the signal is 
actually sparse. Section 6 removes this assumption. 

The result states that each iteration of the algorithm reduces the approximation error by a con- 
stant factor, while adding a small multiple of the noise. As a consequence, when the approximation 
error is large in comparison with the noise, the algorithm makes substantial progress in identifying 
the unknown signal. 

Theorem 4.1 (Iteration Invariant: Sparse Case). Assume that x is s-sparse. For each k > 0, the 
signal approximation a k is s-sparse, and 

\\x - a h+1 \\ 2 < 0.5\\x - a k \\ 2 + 7.5 ||e|| 2 . 

In particular, 

\\x - a k \\ 2 < 2~ k \\x\\ 2 + 15 ||e|| 2 . 

The argument proceeds in a sequence of short lemmas, each corresponding to one step in the 
algorithm. Throughout this section, we retain the assumption that x is s-sparse. 

4.1. Approximations, Residuals, etc. Fix an iteration k > 1. We write a = a k ~ l for the signal 
approximation at the beginning of the iteration. Define the residual r = x — a, which we interpret 
as the part of the signal we have not yet recovered. Since the approximation a is always s-sparse, 
the residual r must be 2s-sparse. Notice that the vector v of updated samples can be viewed as 
noisy samples of the residual: 

v = u — $a = <&{x — a) + e = $r + e. 

4.2. Identification. The identification phase produces a set of components where the residual 
signal still has a lot of energy. 

Lemma 4.2 (Identification). The set = supp(y2s) contains at most 2s indices, and 

\\r\o.c\\ 2 < 0.2223 ||r|| 2 + 2.34 ||e|| 2 . 

Proof. The identification phase forms a proxy y = *&*v for the residual signal. The algorithm then 
selects a set O of 2s components from y that have the largest magnitudes. The goal of the proof is 
to show that the energy in the residual on the set Q c is small in comparison with the total energy 
in the residual. 

Define the set R = supp(r). Since R contains at most 2s elements, our choice of £1 ensures that 
II 1/1 fill 2 — 111/ Ml 2- By squaring this inequality and canceling the terms in RnCt, we discover that 

||yU\fi|| 2 < ||yb\fi|| 2 • 

Since the coordinate subsets here contain few elements, we can use the restricted isometry constants 
to provide bounds on both sides. 

First, observe that the set Q\R contains at most 2s elements. Therefore, we may apply Propo- 
sition 3.1 and Corollary 3.3 to obtain 

||y|n\fl|| 2 = ||*n\Ji(* r + c )|| 2 

< ||*Q\R* r || 2 + K\.R e || 2 

< <5 4s ||r|| 2 + \/l + 5 2s ||e|| 2 . 
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Likewise, the set R \ Q contains 2s elements or fewer, so Proposition 3.1 and Corollary 3.3 yield 
||yU\n|| 2 = ||*^ n (*r + e)|| 2 

> ||*H\n* • r U\^ll 2 " ||*fl\n* • r MI 2 " ll*V e ll 2 

> (I- S2s)\\r\ R \ n \\ 2 - 5 2s \\r\\ 2 - y/l + S 2s ||e|| 2 . 

Since the residual is supported on R, we can rewrite r\ R \Q = v\qc Finally, combine the last three 
inequalities and rearrange to obtain 



. (fas + fas) \\r\\ 2 + 2a/I + fas ||e|| 2 

l|r|nc|| 2 < ^ • 

Invoke the numerical hypothesis that <5 2s < fas < 0.1 to complete the argument. □ 

4.3. Support Merger. The next step of the algorithm merges the support of the current signal 
approximation a with the newly identified set of components. The following result shows that 
components of the signal x outside this set have very little energy. 

Lemma 4.3 (Support Merger). Let Q be a set of at most 2s indices. The set T = 17 U supp(a) 
contains at most 3s indices, and 

\\ x \t c \\2 < || r b c |l2 • 

Proof. Since supp(a) C T, we find that 

IMtH^ = \\( x - a )|T c |l2 = ll r |T c |l2 ^ Il r b c ll2 ) 

where the inequality follows from the containment T c C Q c . □ 

4.4. Estimation. The estimation step of the algorithm solves a least-squares problem to obtain 
values for the coefficients in the set T. We need a bound on the error of this approximation. 

Lemma 4.4 (Estimation). Let T be a set of at most 3s indices, and define the least-squares signal 
estimate b by the formulae 

b\x = & T u and 6|r c = 0. 

Then 

\\ x — b|| 2 < 1.112 ||o;|tc|| 2 + 1-06 ||e|| 2 . 

This result assumes that we solve the least-squares problem in infinite precision. In practice, the 
right-hand side of the bound contains an extra term owing to the error from the iterative least- 
squares solver. In Section 5, we study how many iterations of the least-squares solver are required 
to make the least-squares error negligible in the present argument. 

Proof. Note first that 

\\ x — b\\ 2 < 1 1 as I 1 1 2 + \\ x \t ~ b|r|| 2 • 
Using the expression u = <&x + e and the fact <&^<&t = It ; we calculate that 

\\x\ T - b\ T \\ 2 = \\x\t ~ * T (* ' x \t + * ■ x \tc + e)|| 2 
= ||*^(* " x \t c + e )|| 2 
< IK*^)- 1 *^* • a;| r =|| 2 + ||* r e || 2 - 
The cardinality of T is at most 3s, and x is s-sparse, so Proposition 3.1 and Corollary 3.3 imply 
that 

It - &HI2 < 1 ~r ||*r* • x \t4 2 + h 1 f ll e ll 2 
1 - <J3s yl - fas 

, fas ||e|| 2 

< \\x r c U H / ■ 



\x\ 
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Combine the bounds to reach 



\x — b\\ 2 < 



1 + 



1-5: 



3s 



|a?|r c || 2 + 



3s 



Finally, invoke the hypothesis that 633 < 643 < 0.1. □ 

4.5. Pruning. The final step of each iteration is to prune the intermediate approximation to its 
largest s terms. The following lemma provides a bound on the error in the pruned approximation. 

Lemma 4.5 (Pruning). The pruned approximation b s satisfies 

\\x — b s \\ 2 < 2 1 1 a? — &H2 . 

Proof. The intuition is that b s is close to b, which is close to x. Rigorously, 

\\x — b s \\ 2 < \\x — b\\ 2 + || b — b s \\ 2 < 2 ||a; — fo|| 2 . 

The second inequality holds because b s is the best s-sparse approximation to b. In particular, the 
s-sparse vector a; is a worse approximation. □ 

4.6. Proof of Theorem 4.1. We now complete the proof of the iteration invariant for sparse 
signals, Theorem 4.1. At the end of an iteration, the algorithm forms a new approximation a k = b s , 
which is evidently s-sparse. Applying the lemmas we have established, we easily bound the error: 

\\x — a h \\ 2 = \\x — b s \\ 2 

< 2\\x — b\\ 2 Pruning (Lemma 4.5) 

< 2 • (1.112 ||:t|t C || 2 + 1.06 ||e|| 2 ) Estimation (Lemma 4.4) 

< 2.224 ||r|nc|| 2 + 2.12 ||e|| 2 Support merger (Lemma 4.3) 

< 2.224 • (0.2223 ||r|| 2 + 2.34 ||e|| 2 ) + 2.12 ||e|| 2 Identification (Lemma 4.2) 

< 0.5 ||r|| 2 + 7.5 ||e|| 2 

= 0.5||x - a fe_1 || 2 + 7.5 ||e|| 2 . 
To obtain the second bound in Theorem 4.1, simply solve the error recursion and note that 

(1 + 0.5 + 0.25 + ...)• 7.5 ||e|| 2 < 15 ||e|| 2 . 
This point completes the argument. 

5. Analysis of Iterative Least-squares 

To develop an efficient implementation of CoSaMP, it is critical to use an iterative method when 
we solve the least-squares problem in the estimation step. The two natural choices are Richardson's 
iteration and conjugate gradient. The efficacy of these methods rests on the assumption that the 
sampling operator has small restricted isometry constants. Indeed, since the set T constructed in 
the support merger step contains at most 3s components, the hypothesis 54 S < 0.1 ensures that the 
condition number 

This condition number is closely connected with the performance of Richardson's iteration and 
conjugate gradient. In this section, we show that Theorem 4.1 holds if we perform a constant 
number of iterations of either least-squares algorithm. 
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5.1. Richardson's Iteration. For completeness, let us explain how Richardson's iteration can be 
applied to solve the least-squares problems that arise in CoSaMP. Suppose we wish to compute 
A^u where A is a tall, full-rank matrix. Recalling the definition of the pseudoinverse, we realize 
that this amounts to solving a linear system of the form 

(A*A)b = A*u. 

This problem can be approached by splitting the Gram matrix: 

A* A = I + M 

where M = A* A — I. Given an initial iterate z°, Richardon's method produces the subsequent 
iterates via the formula 

z t+1 = A*u - Mz e . 

Evidently, this iteration requires only matrix- vector multiplies with A and A*. It is worth noting 
that Richardson's method can be accelerated [1, Sec. 7.2.5], but we omit the details. 
It is quite easy to analyze Richardson's iteration [1, Sec. 7.2.1]. Observe that 

\\z e+1 - At u || 2 = \\M(z e - A^u)\\ 2 < \\M\\ \\z e - A*u\\ 2 . 

This recursion delivers 

ll^-A 1 "™^ < ||M|| < ||z°-A t «|| 2 for £ = 0,1,2,.... 

In words, the iteration converges linearly. 

In our setting, A = <&t where T is a set of at most 3s indices. Therefore, the restricted isometry 
inequalities (1.1) imply that 

||M|| = ||*t*t -I|| < S 3s . 
We have assumed that 5s s < 5^ < 0.1, which means that the iteration converges quite fast. 
Once again, the restricted isometry behavior of the sampling matrix plays an essential role in the 
performance of the CoSaMP algorithm. 

Conjugate gradient provides even better guarantees for solving the least-squares problem, but 
it is somewhat more complicated to describe and rather more difficult to analyze. We refer the 
reader to [1, Sec. 7.4] for more information. The following lemma summarizes the behavior of both 
Richardson's iteration and conjugate gradient in our setting. 

Lemma 5.1 (Error Bound for LS). Richardson's iteration produces a sequence {z e } of iterates that 
satisfy 

\\z e - & T u\\ 2 < 0.1 e ■ \\z° - <f> j T u\\ 2 for £ = 0,1,2,.... 
Conjugate gradient produces a sequence of iterates that satisfy 

\\z e - $ j T u\\ 2 < 2 • p e • \\z° - & T u\\ 2 for £ = 0,1,2,.... 

where 

= y v t < o 072. 

y/K(3>* T 3> T ) + 1 - 

5.2. Initialization. Iterative least-squares algorithms must be seeded with an initial iterate, and 
their performance depends heavily on a wise selection thereof. CoSaMP offers a natural choice for 
the initializer: the current signal approximation. As the algorithm progresses, the current signal 
approximation provides an increasingly good starting point for solving the least-squares problem. 

Lemma 5.2 (Initial Iterate for LS). Let x be an s-sparse signal with noisy samples u = &x + e. 
Let a k ~ 1 be the signal approximation at the end of the (k — l)th iteration, and let T be the set of 
components identified by the support merger. Then 

\\a k - 1 - *^w|| 2 < 2.112||aj - a fc_1 || 2 + 1-06 ||e|| 2 
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Proof. By construction of T, the approximation a k ~ l is supported inside T, so 

IHH| 2 = ||(a5 - a^Mla < \\x - a fc_1 || 2 - 
Using Lemma 4.4, we may calculate how far a fc_1 lies from the solution to the least-squares problem. 

|| O.^ -1 - *r U || 2 - \\ X ~ + \\ X ~ *T M ll2 

< ||x - a fe_1 || 2 + 1.112||a;|rc|| 2 + 1-06 ||e|| 2 

< 2.112||a;-a /c - 1 || 2 + 1.06||e|| 2 . 

Roughly, the error in the initial iterate is controlled by the current approximation error. □ 

5.3. Iteration Count. We need to determine how many iterations of the least-squares algorithm 
are required to ensure that the approximation produced is sufficiently good to support the perfor- 
mance of CoSaMP. 

Corollary 5.3 (Estimation by Iterative LS). Suppose that we initialize the LS algorithm with 
z° = a k ~ l . After at most three iterations, both Richardson's iteration and conjugate gradient 
produce a signal estimate b that satisfies 

\\x - b\\ 2 < 1.112||£c| T c|| 2 + 0.0022||cc - a fe_1 || 2 + 1.062 ||e|| 2 . 

Proof. Combine Lemma 5.1 and Lemma 5.2 to see that three iterations of Richardson's method 
yield 

||z 3 - $ j T u\\ 2 < 0.002112||cc - a fe_1 || 2 + 0.00106 ||e|| 2 . 

The bound for conjugate gradient is slightly better. Let b\x = z 3 . According to the estimation 
result, Lemma 4.4, we have 

\\x - & T u\\ 2 < 1.112||x|tc|| + 1.06 ||e|| 2 . 
An application of the triangle inequality completes the argument. □ 

5.4. CoSaMP with Iterative least-squares. Finally, we need to check that the sparse iteration 
invariant, Theorem 4.1 still holds when we use an iterative least-squares algorithm. 

Theorem 5.4 (Sparse Iteration Invariant with Iterative LS). Suppose that we use Richardson's 
iteration or conjugate gradient for the estimation step, initializing the LS algorithm with the current 
approximation a k ~ 1 and performing three LS iterations. Then Theorem 4-1 still holds. 

Proof. We repeat the calculation in Section 4.6 using Corollary 5.3 instead of the simple estimation 
lemma. To that end, recall that the residual r = x — a k ~ 1 . Then 

\\x - a k \\ 2 < 2 \\x - b\\ 2 

< 2 • (1.112 ||£c| T c|| 2 + 0.0022 ||r|| 2 + 1.062 ||e|| 2 ) 

< 2.224 ||r|oc|| 2 + 0.0044 ||r|| 2 + 2.124 ||e|| 2 

< 2.224 • (0.2223 ||r|| 2 + 2.34 ||e|| 2 ) + 0.0044 ||r|| 2 + 2.124 ||e|| 2 

< 0.5 ||r|| 2 + 7.5 ||e|| 2 

= 0.5 1| a; - a fc_1 || 2 + 7.5 ||e|| 2 . 

This bound is precisely what is required for the theorem to hold. □ 
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6. Extension to General Signals 



In this section, we finally complete the proof of the main result for CoSaMP, Theorem 2.1. The 
remaining challenge is to remove the hypothesis that the target signal is sparse, which we framed in 
Theorems 4.1 and 5.4. Although this difficulty might seem large, the solution is simple and elegant. 
It turns out that we can view the noisy samples of a general signal as samples of a sparse signal 
contaminated with a different noise vector that implicitly reflects the tail of the original signal. 



The sample vector 



| 2 < 1.05 



\x — x 



s II 2 /— \\^ II 1 



+ e 



2 • 



Lemma 6.1 (Reduction to Sparse Case). Let x be an arbitrary vector in C N . 
u = <&x + e can also be expressed as u = <&x s + e where 

1 

Proof. Decompose x = x s + (x — x s ) to obtain u = &x s + e where e = <&(x — x s ) + e. To compute 
the size of the error term, we simply apply the triangle inequality and Proposition 3.5: 

1 

^s|l2 "T" /— 11*^ t °s|ll 



|e|| 2 < \A + 5 S 



+ e 



Finally, invoke the fact that 5 S < $As < 0.1 to obtain y/l + S s < 1.05. 

This lemma is just the tool we require to complete the remaining argument. 



□ 



Proof of Theorem 2.1. Let x be a general signal, and use Lemma 6.1 to write the noisy vector of 
samples u = &x s + e. Apply the sparse iteration invariant, Theorem 4.1, or the analog for iterative 
least-squares, Theorem 5.4. We obtain 



x* 



a 



fc+i I 



< 0.5 \\x s - a fc L + 7.5 llel 



12 - - 112 

Invoke the lower and upper triangle inequalities to obtain 



2 • 



X 



a 



k+l\ 



< 0.5 cc - or L + 7.5 ||e|| 9 + 1.5 \\x 



l 2 ^ || 2 i , M w||2 -r j-.o x s\\2 

Finally, recall the estimate for ||e|| 2 from Lemma 6.1, and simplify to reach 

fell oo^r,, II 7 - 875 

a + 9.375 \\x — x s \\ 2 H — \\x — x x 



\x 



a 



k+1 \\ 2 < 0.5\\x 
< 0.5\\x 



i + 7-5 ||e| 



o* L + lOi/. 



where v is the unrecoverable energy (2.1). 



□ 



7. Discussion and Related Work 

CoSaMP draws on both algorithmic ideas and analytic techniques that have appeared before. 
This section describes the other major signal recovery algorithms, and it compares them with 
CoSaMP. It also attempts to trace the key ideas in the algorithm back to their sources. 

7.1. Algorithms for Compressive Sampling. We begin with a short discussion of the major 
algorithmic approaches to signal recovery from compressive samples. We focus on provably correct 
methods, although we acknowledge that some ad hoc techniques provide excellent empirical results. 

The initial discovery works on compressive sampling proposed to perform signal recovery by 
solving a convex optimization problem [2, 12]. Given a sampling matrix $ and a noisy vector of 
samples u = + e, consider the mathematical program 

minllyllj subject to <&y = u. (7.1) 

In words, we look for a signal reconstruction that is consistent with the samples but has minimal 
i\ norm. The intuition behind this approach is that minimizing the i\ norm promotes sparsity, so 
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allows the approximate recovery of compressible signals. Candes, Romberg, and Tao established 
in [3] that a minimizer a of (7.1) satisfies 



provided that the sampling matrix $ has restricted isometry constant 5^ s < 0.2. In [4], the 
hypothesis on the restricted isometry constant is sharpened to 82s < \/2 — 1- The error bound for 
CoSaMP is equivalent, modulo the exact value of the constants. 

The literature describes a huge variety of algorithms for solving the optimization problem (7.1). 
The most common approaches involve interior-point methods [2, 24], projected gradient meth- 
ods [14], or iterative thresholding [11] The interior-point methods are guaranteed to solve the 
problem to a fixed precision in time 0(m 2 iV L5 ), where m is the number of measurements and N is 
the signal length [29] . Note that the constant in the big-0 notation depends on some of the problem 
data. The other convex relaxation algorithms, while sometimes faster in practice, do not currently 
offer rigorous guarantees. CoSaMP provides rigorous bounds on the runtime that are much better 
than the available results for interior-point methods. 

Tropp and Gilbert proposed the use of a greedy iterative algorithm called orthogonal matching 
pursuit (OMP) for signal recovery [36]. The algorithm initializes the current sample vector v = u. 
In each iteration, it forms the signal proxy y = <&*v and identifies a component of the proxy with 
largest magnitude. It adds the new component to the set T of previously identified components. 
Then OMP forms a new signal approximation by solving a least-squares problem: a = ^ T u. 
Finally, it updates the samples v = u — 3>a. These steps are repeated until a halting criterion is 
satisfied. 

Tropp and Gilbert were able to prove a weak result for the performance of OMP [36] . Suppose 
that x is a fixed, s-sparse signal, and let m = Cs log N. Draw an m x N sampling matrix <1> whose 
entries are independent, zero-mean subgaussian 2 random variables with equal variances. Given 
noiseless measurements u = *&x, OMP reconstructs x after s iterations, except with probability 
iV -1 . In this setting, OMP must fail for some sparse signals, so it does not provide the same 
uniform guarantees as convex relaxation. It is unknown whether OMP succeeds for compressible 
signals or whether it succeeds when the samples are contaminated with noise. 

Donoho et al. invented another greedy iterative method called stagewise OMP, or StOMP [13]. 
This algorithm uses the signal proxy to select multiple components at each step, using a rule 
inspired by ideas from wireless communications. The algorithm is faster than OMP because of 
the selection rule, and it sometimes provides good performance, although parameter tuning can be 
difficult. There are no rigorous results available for StOMP. 

Very recently, Needed and Vershynin developed and analyzed another greedy approach, called 
regularized OMP, or ROMP [28, 27] . This algorithm is similar to OMP but uses a more sophisticated 
selection rule. Among the s largest entries of the signal proxy, it identifies the largest subset 
whose entries differ in magnitude by at most a factor of two. The work on ROMP represents an 
advance because the authors establish under restricted isometry hypotheses that their algorithm 
can approximately recover any compressible signal from noisy samples. More precisely, suppose that 
the sampling matrix $ has restricted isometry constant 5$ s < O.Ol/ydog s. Given noisy samples 
u = &x + e, ROMP produces a 2s-sparse signal approximation a that satisfies 



This result is comparable with the result for convex relaxation, aside from the extra logarithmic 
factor in the restricted isometry hypothesis and the error bound. The results for CoSaMP show 
that it does not suffer these parasitic factors, so its performance is essentially optimal. 

A subgaussian random variable Z satisfies P {\Z\ > t} < ce~ c for all t > 0. 



x — a\\ 2 < C — jz \\x — 




Hi 11^112 



(7.2) 



x — a\\ 2 < Cy / logs 



—j= \\X "Eslll + ||<2||2 
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After we initially presented this work, Dai and Milenkovic developed an algorithm called Subspace 
Pursuit that is very similar to CoSaMP. They established that their algorithm offers performance 
guarantees analogous with those for CoSaMP. See [10] for details. 

Finally, we note that there is a class of sublinear algorithms for signal reconstruction from 
compressive samples. A sublinear algorithm uses time and space resources that are asymptotically 
smaller than the length of the signal. One of the earliest such techniques is the Fourier sampling 
algorithm of Gilbert et al. [19, 21]. This algorithm uses random (but structured) time samples to 
recover signals that are compressible with respect to the discrete Fourier basis. Given spolylog(A^) 
samples 3 , Fourier sampling produces a signal approximation a that satisfies 



except with probability iV -1 . The result for Fourier sampling holds for each signal (rather than for 
all). Later, Gilbert et al. developed two other sublinear algorithms, chaining pursuit [16] and HHS 
pursuit [17], that offer uniform guarantees for all signals. Chaining pursuit has an error bound 



which is somewhat worse than (7.2). HHS pursuit achieves the error bound (7.2). These methods 
all require more measurements than the linear and superlinear algorithms (by logarithmic factors), 
and these measurements must be highly structured. As a result, the sublinear algorithms may not 
be useful in practice. 

The sublinear algorithms are all combinatorial in nature. They use ideas from group testing to 
identify the support of the signal quickly. There are several other combinatorial signal recovery 
methods due to Cormode-Muthukrishnan [9] and Iwen [22]. These algorithms have drawbacks 
similar to the sublinear approaches. 

7.2. Relative Performance. Table 2 summarizes the relative behavior of these algorithms in 

terms of the following criteria. 

General samples: Does the algorithm work for a variety of sampling schemes? Or does it 
require structured samples? The designation "RIP" means that a bound on a restricted 
isometry constant suffices. "Subgauss." means that the algorithm succeeds for the class of 
subgaussian sampling matrices. 

Optimal number of samples: Can the algorithm recover s sparse signals from O(slogiV) 
measurements? Or are its sampling requirements higher (by logarithmic factors)? 

Uniformity: Does the algorithm recover all signals given a fixed sampling matrix? Or do the 
results require a sampling matrix to be drawn at random for each signal? 

Stability: Does the algorithm succeed when (a) the signal is compressible but not sparse and 
(b) when the samples are contaminated with noise? In most cases, stable algorithms have 
error bounds similar to (7.2). See the discussion above for details. 

Running time: What is the worst-case cost of the algorithm to recover a real- valued s-sparse 
signal to a fixed relative precision, given a sampling matrix with no special structure? The 
designation LP(iV, m) indicates the cost of solving a linear program with N variables 
and m constraints, which is 0(m 2 N 1 ' 5 ) for an interior-point method. Note that most of 
the algorithms can also take advantage of fast matrix-vector multiplies to obtain better 
running times. 

Of the linear and superlinear algorithms, CoSaMP achieves the best performance on all these 
metrics. Although CoSaMP is slower than the sublinear algorithms, it makes up for this shortcoming 
by allowing more general sampling matrices and requiring fewer samples. 



x ~ a \\2 — C \\ x ~ x 




x — < ClogiV \\x — x 




'The term polylog indicates a function that is dominated by a polynomial in the logarithm of its argument. 



22 



D. NEEDELL AND J. A. TROPP 





CoSaMP 


OMP 


ROMP 


Convex opt. 


Fourier Samp. 


HHS Pursuit 


General samples 


RIP 


Subgauss. 


RIP 


RIP 


no 


no 


Opt. # samples 


yes 


yes 


no 


yes 


no 


no 


Uniformity 


yes 


no 


yes 


yes 


no 


yes 


Stability 


yes 


? 


yes 


yes 


yes 


yes 


Running time 


O(miV) 


O(smN) 


O(smN) 


LP(iV, m) 


s polylog(iV) 


poly (s log N) 



Table 2. Comparison of several signal recovery algorithms. The notation s refers 
to the sparsity level; m refers the number of measurements; N refers to the signal 
length. See the text for comments on specific designations. 



7.3. Key Ideas. We conclude with a historical overview of the ideas that inform the CoSaMP 
algorithm and its analysis. 

The overall greedy iterative structure of CoSaMP has a long history. The idea of approaching 
sparse approximation problems in this manner dates to the earliest algorithms. In particular, 
methods for variable selection in regression, such as forward selection and its relatives, all take this 
form [26] . Temlyakov's survey [33] describes the historical role of greedy algorithms in nonlinear 
approximation. Mallat and Zhang introduced greedy algorithms into the signal processing literature 
and proposed the name matching pursuit [25]. Gilbert, Strauss, and their collaborators showed how 
to incorporate greedy iterative strategies into fast algorithms for sparse approximation problems, 
and they established the first rigorous guarantees for greedy methods [18, 19]. Tropp provided a 
new theoretical analysis of OMP in his work [34]. Subsequently, Tropp and Gilbert proved that 
OMP was effective for compressive sampling [36]. 

Unlike the simplest greedy algorithms, CoSaMP identifies many components during each itera- 
tion, which allows the algorithm to run faster for many types of signals. It is not entirely clear 
where this idea first appeared. Several early algorithms of Gilbert et al. incorporate this approach 
[20, 37], and it is an essential feature of the Fourier sampling algorithm [19, 21]. More recent com- 
pressive sampling recovery algorithms also select multiple indices, including chaining pursuit [16], 
HHS pursuit [17], StOMP [13], and ROMP [28]. 

CoSaMP uses the restricted isometry properties of the sampling matrix to ensure that the iden- 
tification step is successful. Candes and Tao isolated the restricted isometry conditions in their 
work on convex relaxation methods for compressive sampling [5]. The observation that restricted 
isometries can also be used to ensure the success of greedy methods is relatively new. This idea 
plays a role in HHS pursuit [17], but it is expressed more completely in the analysis of ROMP [28]. 

The pruning step of CoSaMP is essential to maintain the sparsity of the approximation, which 
is what permits us to use restricted isometries in the analysis of the algorithm. It also has signifi- 
cant ramifications for the running time because it impacts the speed of the iterative least-squares 
algorithms. This technique originally appeared in HHS pursuit [17]. 

The iteration invariant, Theorem 2.1, states that if the error is large then CoSaMP makes sub- 
stantial progress. This approach to the overall analysis echoes the analysis of other greedy iterative 
algorithms, including the Fourier sampling method [19, 21] and HHS Pursuit [17]. 

Finally, mixed-norm error bounds, such as that in Theorem A, have become an important feature 
of the compressive sampling literature. This idea appears in the work of Candes-Romberg-Tao 
on convex relaxation [3]; it is used in the analysis of HHS pursuit [17]; it also plays a role in the 
theoretical treatment of Cohen-Dahmen-DeVore [7] . 

Acknowledgment. We would like to thank Martin Strauss for many inspiring discussions. He is 
ultimately responsible for many of the ideas in the algorithm and analysis. We would also like to 
thank Roman Vershynin for suggestions that drastically simplified the proofs. 
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Appendix A. Algorithmic Variations 

This appendix describes other possible halting criteria and their consequences. It also proposes 
some other variations on the algorithm. 

A.l. Halting Rules. There are three natural approaches to halting the algorithm. The first, 
which we have discussed in the body of the paper, is to stop after a fixed number of iterations. 
Another possibility is to use the norm ||v|| 2 of the current samples as evidence about the norm 
||r|| 2 of the residual. A third possibility is to use the magnitude WyW^ of the entries of the proxy 
to bound the magnitude Wr]]^ of the entries of the residual. 

It suffices to discuss halting criteria for sparse signals because Lemma 6.1 shows that the general 
case can be viewed in terms of sampling a sparse signal. Let x be an s-sparse signal, and let a be 
an s-sparse approximation. The residual r = x — a. We write v = $r + e for the induced noisy 
samples of the residual and y = &*v for the signal proxy. 

The discussion proceeds in two steps. First, we argue that an a priori halting criterion will result 
in a guarantee about the quality of the final signal approximation. 

Theorem A.l (Halting I). The halting criterion \\v\\ 2 < e ensures that 

\\x — a\\ 2 < 1.06 • (e + ||e|| 2 ). 

The halting criterion WyW^ < rj/y/2s ensures that 

\\x - aW^ < 1.1277 + 1-17 || 6 1| 2 . 
Proof. Since r is 2s-sparse, Proposition 3.1 ensures that 

\A — 5 2s \\r\\ 2 — ||e|| 2 < |M| 2 ■ 

If || v\\ 2 < e, it is immediate that 

ML < 



2 " y/l^T 2 s 

The definition r = x — a and the numerical bound 5 2s < 5^ s < 0.1 dispatch the first claim. 
Let R = supp(r), and note that |i?| < 2s. Proposition 3.1 results in 

(1 - S 2s ) \\r\\ 2 - y/l + 5 2s ||e|| 2 < ||y|fl|| 2 . 

Since 

||y|i?|| 2 < v 7 ^ Hylfllloo < V2s~ Hvlloo , 

we find that the requirement WyW^ < r)/V2s ensures that 

II || < r/ + ^1 + J 2s ||e|| 2 
" r|lo °- l-6 2s 
The numerical bound 5 2s < 0.1 completes the proof. 

□ 

Second, we check that each halting criterion is triggered when the residual has the desired 
property. 

Theorem A. 2 (Halting II). The halting criterion \\v\\ 2 < e is triggered as soon as 

\\x — a\\ 2 < 0.95 • (e — ||e|| 2 ). 
The halting criterion WyW^ < r\j\pls is triggered as soon as 

0.45r/ 0.68||e|L 
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Proof. Proposition 3.1 shows that 

IMI2 - V 1 + $2s \\r\\ 2 + 

Therefore, the condition 



|e|| 2 



Ml 2 < 



£ 



ensures that ||v|| 2 < £■ Note that 82s < 0.1 to complete the first part of the argument. 

Now let R be the singleton containing the index of a largest-magnitude coefficient of y. Propo- 
sition 3.1 implies that 

Ill/Hoc = WvWh < y/l + 81 IMI 2 - 

By the first part of this theorem, the halting criterion WyW^ < rj/V^s is triggered as soon as 

\\x - a\L < 0.95 • ( 71 - llel 
Since x — a is 2s-sparse, we have the bound \\x — a\\ 2 < \/2s \\x — <x || ^ . To wrap up, recall that 

<*1 < ha < 0.1. " □ 

A. 2. Other Variations. This section briefly describes several natural variations on CoSaMP that 
may improve its performance. 

(1) Here is a version of the algorithm that is, perhaps, simpler than Algorithm 2.1. At each 
iteration, we approximate the current residual rather than the entire signal. This approach 
is similar to HHS Pursuit [17]. The inner loop changes in the following manner. 

Identification: As before, select 12 = supp(y2s)- 

Estimation: Solve a least-squares problem with the current samples instead of the orig- 
inal samples to obtain an approximation of the residual signal. Formally, b = In 
this case, one initializes the iterative least-squares algorithm with the zero vector to 
take advantage of the fact that the residual is becoming small. 

Approximation Merger: Add this approximation of the residual to the previous ap- 
proximation of the signal to obtain a new approximation of the signal: c = a fc_1 + b. 

Pruning: Construct the s-sparse signal approximation: a k = c s . 

Sample Update: Update the samples as before: v = u — 3>a fe . 

We can show that this algorithm satisfies a result similar to Theorem 2.1 by adapting the 
current argument. We were unable to verify an analog of Theorem 2.2. Nevertheless, we 
believe that this version of the algorithm is also promising. 

(2) After the inner loop of the algorithm is complete, we can solve another least-squares problem 
in an effort to improve the final result. If a is the approximation at the end of the loop, 
we set T = supp(a). Then solve b = & T u and output the s-sparse signal approximation b. 
Note that the output approximation is not guaranteed to be better than a because of the 
noise vector e, but it should never be much worse. 

(3) Another variation is to prune the merged support T down to s entries before solving the 
least-squares problem. One may use the values of the proxy y as surrogates for the unknown 
values of the new approximation on the set O. Since the least-squares problem is solved at 
the end of the iteration, the columns of $ that are used in the least-squares approximation 
are orthogonal to the current samples v. As a result, the identification step always selects 
new components in each iteration. We have not attempted an analysis of this algorithm. 
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(a) Low profile (b) High profile 

Figure 1. Illustration of two unit-norm signals with sharply different profiles. 



Appendix B. Iteration Count 

In this appendix, we obtain an estimate on the number of iterations of the CoSaMP algorithm 
necessary to identify the recoverable energy in a sparse signal, assuming exact arithmetic. Except 
where stated explicitly, we assume that x is s-sparse. It turns out that the number of iterations 
depends heavily on the signal structure. Let us explain the intuition behind this fact. 

When the entries in the signal decay rapidly, the algorithm must identify and remove the largest 
remaining entry from the residual before it can make further progress on the smaller entries. Indeed, 
the large component in the residual contaminates each component of the signal proxy. In this case, 
the algorithm may require an iteration or more to find each component in the signal. 

On the other hand, when the s entries of the signal are comparable, the algorithm can simulta- 
neously locate many entries just by reducing the norm of the residual below the magnitude of the 
smallest entry. Since the largest entry of the signal has magnitude at least s _1//2 times the £2 norm 
of the signal, the algorithm can find all s components of the signal after about logs iterations. 

To quantify these intuitions, we want to collect the components of the signal into groups that 
are comparable with each other. To that end, define the component bands of a signal x by the 
formulae 



B, 



i : 2 



-O+i) 



< 2-3 



x 



0,1,2, 



(B.l) 



j I for j 

The profile of the signal is the number of bands that are nonempty: 

profile^) d = f #{j : Bj + 0}. 

In words, the profile counts how many orders of magnitude at which the signal has coefficients. It 
is clear that the profile of an s-sparse signal is at most s. See Figure 1 for images of stylized signals 
with different profiles. 

First, we prove a result on the number of iterations needed to acquire an s-sparse signal. At the 
end of the section, we extend this result to general signals, which yields Theorem 2.2. 

Theorem B.l (Iteration Count: Sparse Case). Let x be an s-sparse signal, and define p = 
profile(a;). After at most 

P l og i/3 {l + 4.6 v / s/p) + 6 
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iterations, CoSaMP produces an approximation a that satisfies 

\\x — ct 1 1 2 < 17 ||e|| 2 . 

For a fixed s, the bound on the number of iterations achieves its maximum value at p = s. Since 
lo g 4/ / 3 5.6 < 6, the number of iterations never exceeds 6(s + 1). 

B.l. Additional Notation. Let us instate some notation that will be valuable in the proof of the 
theorem. We write p = profile(a:). For each fc = 0, 1, 2, ... , the signal a k is the approximation after 
the fcth iteration. We abbreviate Sk = supp(a fc ), and we define the residual signal r k = x — a k . 
The norm of the residual can be viewed as the approximation error. 
For a nonnegative integer j, we may define an auxiliary signal 

i dcf I 

y = x h>,B t 

In other words, y J is the part of x contained in the bands Bj, Bj+i, Bj + 2, .... For each j G J, we 
have the estimate 

by definition of the bands. These auxiliary signals play a key role in the analysis. 

B.2. Proof of Theorem B.l. The proof of the theorem involves a sequence of lemmas. The 
first object is to establish an alternative that holds in each iteration. One possibility is that the 
approximation error is small, which means that the algorithm is effectively finished. Otherwise, 
the approximation error is dominated by the energy in the unidentified part of the signal, and the 
subsequent approximation error is a constant factor smaller. 

Lemma B.2. For each iteration k = 0,1,2, ... , at least one of the following alternatives holds. 
Either 

\\r k \\ 2 < 70||e|| 2 (B.3) 

or else 

\\r k L < 2.3||a;|sf|L and (B.4) 

\\ \\ Z II 1 k II 2 

\\r k+1 \\ 2 < 0.75||r fe || 2 . (B.5) 

Proof. Define as the merged support that occurs during iteration k. The pruning step ensures 
that the support Sk of the approximation at the end of the iteration is a subset of the merged 
support, so 

||a;|T fc c || 2 < |Hs£|| 2 for fc = 1,2,3, 

At the end of the kth iteration, the pruned vector b s becomes the next approximation a k , so the 
estimation and pruning results, Lemmas 4.4 and 4.5, together imply that 

||r fc || 2 <2-(1.112||aj|^|| 2 + 1.06||c|| 2 ) 

< 2.224||£c| 5 c|| 2 + 2.12 ||e|| 2 for fc = 1,2, 3, (B.6) 

Note that the same relation holds trivially for iteration fc = because r° = x and So = 0. 
Suppose that there is an iteration fc > where 

|Hsc|| 2 <30||e|| 2 . 

We can introduce this bound directly into the inequality (B.6) to obtain the first conclusion (B.3). 
Suppose on the contrary that in iteration k we have 
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Introducing this relation into the inequality (B.6) leads quickly to the conclusion (B.4). We also 
have the chain of relations 

|K|| 2 > ||r- fc y| 2 = ||(a: - a k )\ S c\\ 2 = \\x\s* k \\ 2 > 30 ||e|| 2 . 

Therefore, the sparse iteration invariant, Theorem 4.1 ensures that (B.5) holds. □ 

The next lemma contains the critical part of the argument. Under the second alternative in the 
previous lemma, we show that the algorithm completely identifies the support of the signal, and 
we bound the number of iterations required to do so. 



Lemma B.3. Fix K = |plog 4 / 3 (l + 4.6y s/p)\ . Assume that (B.4) and (B.5) are in force for each 
iteration k = 0, 1, 2, . . . , K. Then supp(a K ) = supp(a;). 

Proof First, we check that, once the norm of the residual is smaller than each element of a band, 
the components in that band persist in the support of each subsequent approximation. Define J to 
be the set of nonempty bands, and fix a band j G J. Suppose that, for some iteration k, the norm 
of the residual satisfies 

|| r *|| 2 < 2 -0'+ 1 )/ 2 ||a;|| 2 . (B.7) 

Then it must be the case that Bj C supp(a fc ). If not, then some component i € Bj appears in the 
residual: r\ = X{. This supposition implies that 



|r fe |L > \xi\ > 2-^l 2 \\x\ 



2 ■ 



an evident contradiction. Since (B.5) guarantees that the norm of the residual declines in each 
iteration, (B.7) ensures that the support of each subsequent approximation contains Bj. 

Next, we bound the number of iterations required to find the next nonempty band Bj, given 
that we have already identified the bands Bi where i < j. Formally, assume that the support Sk of 
the current approximation contains B^ for each % < j. In particular, the set of missing components 
S| C supp(y- ? ). It follows from relation (B.4) that 

||r fc || 2 <2.3||^|| 2 . 
We can conclude that we have identified the band Bj in iteration k + 1 if 

|| r fc+£|| < 2 -(j+l)/2 ii || 

According to (B.5), we reduce the error by a factor of fi^ 1 = 0.75 during each iteration. Therefore, 
the number t of iterations required to identify Bj is at most 



2.3 ja 



■j 



2 -(j+i)/2 na,^ 



We discover that the total number of iterations required to identify all the (nonempty) bands is at 
most 

2(J+i)/2 llyJll 



k * = Y, jeJ l °Zf3 



2.3 



\ x \\2 



For each iteration k > [k k \, it follows that supp(a fc ) = supp(a;). 

It remains to bound k± in terms of the profile p of the signal. For convenience, we focus on a 
slightly different quantity. First, observe that p = \J\. Using the geometric mean-arithmetic mean 
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inequality, we discover that 



exp 



-Y log 

p ^jeJ 



2.3- 



2 °' +1)/2 IMI 2 



\x\ 



2(J +1 )/ 2 M| 



33 



p ^jcJ 



1 + 2.3 



2 °' +1)/2 |k'H 2 X 



33 



/* +1 IMI*V /2 



33 



To bound the remaining sum, we recall the relation (B.2). Then we invoke Jensen's inequality and 
simplify the result. 



ii ii i ) -pEjgj^ Ei>j 2 - ^Ejej 25 E 2 j 



1/2 



<fly i^iy 2^ +i V /2 <f-y \Bi 

= 2 v / i/p. 



1/2 



The final equality holds because the total number of elements in all the bands equals the signal 
sparsity s. Combining these bounds, we reach 



exp < - los 
| p ^jeJ 



2.3 



2 °' +1)/2 ||y j || 2 ' 



33 



< l + 4.6^/s/p. 



Take logarithms, multiply by p, and divide through by log 0. We conclude that the required number 
of iterations k+ is bounded as 



h < plog«(l + 4.6^/s/p). 



This is the advertised conclusion. 



□ 



Finally, we check that the algorithm produces a small approximation error within a reasonable 
number of iterations. 

Proof of Theorem B.l. Abbreviate K = |_plog(l + 4.6y/s/p)\ . Suppose that (B.3) never holds 
during the first K iterations of the algorithm. Under this circumstance, Lemma B.2 implies that 
both (B.4) and (B.5) hold during each of these K iterations. It follows from Lemma B.3 that 
the support Sk of the Kth approximation equals the support of 33. Since Sk is contained in the 
merged support Tk, we see that the vector 33 = 0. Therefore, the estimation and pruning 
results, Lemmas 4.4 and 4.5, show that 

\\r K \\ 2 < 2 • (1.112 \\x\ T £ || 2 + 1-06 ||e|| 2 ) = 2.12 ||e|| 2 . 

This estimate contradicts (B.3). 

It follows that there is an iteration k < K where (B.3) is in force. Repeated applications of the 
iteration invariant, Theorem 4.1, allow us to conclude that 



|r x+6 || 2 <17||e| 



2 • 



This point completes the argument. 



□ 
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< 18.9 ||x — aj s || 2 H \\x — x s \\i + 17 \\e\\ 2 



B.3. Proof of Theorem 2.2. Finally, we extend the sparse iteration count result to the general 
case. 

Theorem B.4 (Iteration Count). Let x be an arbitrary signal, and define p = pronle(a: s ). After 
at most 

P l og 4/3 (l + 4.6 v / s/p) + 6 
iterations, CoSaMP produces an approximation a that satisfies 

\\x — a\\ 2 < 20 ||e|| 2 • 

Proof. Let a; be a general signal, and let p = profilers). Lemma 6.1 allows us to write the noisy 
vector of samples u = *&x s + e. The sparse iteration count result, Theorem B.l, states that after 
at most 

P l og 4/3 (l + 4.6a/s7p) + 6 
iterations, the algorithm produces an approximation a that satisfies 

\\x s — ot. 1 1 2 < 17 ||e|| 2 . 

Apply the lower triangle inequality to the left-hand side. Then recall the estimate for the noise in 
Lemma 6.1, and simplify to reach 

1 1 *^ ^ II 2 — ^ 1 1 ^ 1 1 2 1 1 *^ s 1 1 2 

17.9 

< 20i/, 

where v is the unrecoverable energy. □ 

Proof of Theorem 2.2. Invoke Theorem B.4. Recall that the estimate for the number of iterations 
is maximized with p = s, which gives an upper bound of 6(s + 1) iterations, independent of the 
signal. □ 
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